Project Report

Team Too-pullz

Author

Lila Weiner, Gabrielle Bliss, Riley Otsuki, and Kenneth Yeon

Published

March 15, 2023

Abstract
This project aims to predict the gross of a movie through multivariate linear regression to inform stakeholders about profitability, as some highly grossing films result in huge net losses. We also aim to predict, through multivariate logistic regression, whether a movie will be highly rated on the IMDB platform to inform stakeholders of projected audience opinion. The two response variables of interest are predicted based on a dataset containing predictors such as actor facebook likes, country of origin, release year, and production budget. Simple regression and logistic models for gross and IMDB score were created to maximize inference and prediction. A tentative logistic model classifying highly profitable movies was also explored for a more complete analysis through the lens of inference. Our regression model aims to minimize MAE, while the logistic model aims to optimize the precision and FPR metrics. Lasso and forward stepwise selection were used to further optimize variable selection and inference after initial analysis and transformations of categorical and numerical predictors. Greater popularity of actors and slightly higher durations of films seemed to contribute most beneficially to both high gross and high ratings, while budget had a more significant impact in predicting gross.

1 Background / Motivation

The economics of Hollywood and the international film industry is not always as straightforward as it seems. Movies that gross highly are not always profitable, even if they make more money than their production budget. For example, the movie “‘Harry Potter and the Order of the Phoenix’, […] ended up with a $167 million”loss” even though it’s one of the top grossing films” of the 2010s [1]. Thus, actors, producers, and investing studios who’s salaries and pay have relationships to the profitability of movies are at risk. The profitability of a movie takes into account many expenses (not just production budget), and would be beneficial to predict beforehand in order to allocate money accordingly [2].

Additionally, over the past several years, the movie industry has undergone other immense changes. Streaming services have become the primary mode of movie watching for viewers, at the expense of the in-person movie theater experience. Amidst these changes in the industry, movie makers are faced with uncertainty regarding the success of their product—in the financial sense, but also in the sense of critics and public opinion. Our group was interested in performing an analysis on film data, as we all enjoy watching movies, but were also concerned with mitigating these risks regarding gross, profitability, and assessment of success for the concerned parties. Upon finding our dataset on Kaggle, we wondered if we could determine what, if any, cultural factors contribute to a movie’s success, as it pertains to money and IMDB rating. Our resulting models (linear regression and logistic regression, respectively) for the prediction and inference of total film gross and IMDB rating aim to assess these pertinent cultural issues as well as provide our team with a more comprehensive insight to a topic we possess interest in. Our models, as they predict quantitative success and qualitative success, can thus be used as a basis to make decisions on how to increase gross or ratings in order to create a more successful movie, financially and culturally, and ideally mitigate the issues the film industry might have with profitability and low success rates through public opinion.

2 Problem statement

We are working to identify and predict a movie’s success, financially and through the lens of public opinion, based on its relationship to a collection of predictors in our data. We considered gross income and IMDB rating as the two main metrics of success. Our problem is a combination of inference and prediction. For both metrics, we utilized inference to obtain an understanding of which variables are most pertinent when determining the success of a movie. Our problem also involves prediction as an objective of ours is to predict the gross income that a movie would generate based on the possible predictor variables in the dataset, as described previously. Within our focus of IMDB regression, we wanted our model to predict the probability in which a movie is given a high rating, and then classify the movie as predicted to be highly rated by the public IMDB scores, or not. Since gross is a continuous variable, we aimed to predict its response using a linear regression model. On the contrary, even though IMDB rating is numeric, it is a discrete response variable that is better fitted into the binary category as highly rated or not. Thus, a logistic regression model was used for this objective.

3 Data sources

Our model(s) are based on a dataset found on Kaggle, which contains approximately n = 5,000 observations, each corresponding to a different movie and its cultural characteristics as presented on the IMDB platform. Of these film observations, 3,756 have no missing values. Within our dataset, we identified two response variables: gross income and IMDB rating. There are 25 other variables in the dataset, which had the potential to be possible predictors in our models. These predictors include various facebook-like counts for the director, actor, and movie itself, the genres of the film, the aspect ratio, year of release, production budget, keywords, duration, country of origin, and so on. Notably, the dataset does not include comprehensive financial data on total film expenditure or a comprehensive set of predictors that might be more intuitively used to predict our responses. Therefore, we have framed our objectives as an attempt to predict and explain gross and IMDB rating with cultural predictors in order to mitigate losses and detriments that could occur to concerned parties if a film does not make enough money to be profitable or has a low success rate in terms of public opinion and reviews.

Link to our open-access dataset

4 Stakeholders

Our model(s) primarily serve to benefit movie directors, producers, and investors. These groups are our target stakeholders, as they are all involved in the creation of movies and have something at stake in regards to a film’s success—financially or in the eyes of public opinion. Movie directors and producers can use our findings to determine the amount and viability of their production budgets, as well as the ‘wiggle room’ they have to spend on marketing and other areas important to moviemaking to maximize success. These stakeholders also have their reputations at stake; a highly rated movie in the eyes of public opinion and critics is essential to furthering their careers. Private investors will also gain insights from our analyses, as they will obtain newfound guidance regarding the safety/risk in a potential movie investment. With a deeper understanding of what characteristics contribute to a successful movie, our stakeholders will be able to make well informed decisions regarding casting, marketing, and overall movie creation. Although we are interested in prediction, we also are concerned heavily with inference so that our stakeholder might know what cultural predictors could correspond to a higher success rate (whether this be in terms of gross or ratings) for their movie.

5 Data quality check / cleaning / preparation

Based on our initial data quality checks, we determined that most of our data preparations and cleaning would center around the splitting of categorical variables. Each movie had a column for genres, whose cell/observation consisted of every genre that the movie could be classified as in a string. Thus, some movies had a value of ‘Drama|Horror|Thriller’, ‘Action|Adventure|Fantasy|Sci-Fi’, or just a value of ‘Documentary’. To remedy this, we found all the unique genres mentioned throughout the whole dataset, of which there were 26. We then split the genres column into 26 columns for each genre, and marked whether or not the movie of interest was classified as that genre. Some movies were classified as multiple genres, and we made sure to code this cleaning/preparation so that this held true.

This method of splitting categorical variables was also notably applied to the country column of the dataset. The top two levels of this category, USA and UK, were made into columns (country_UK and country_USA), and the rest of the countries in the dataset were grouped into their own column, country_Others.

While we performed similar operations on other categorical variables, like plot_keywords, they did not become of major use in our models and the preparation was therefore not shown in the project code template.

The other columns created from the dataset were ones to use as the response variable for logistic regression. The IMDB rating column was split into a dummy variable for highly-rated movies, as it was initially a column of discrete integers from 1-10. This split was made using the criteria that if the IMDB score was above 75% quartile, the movie was classified as highly rated (1), else bad (0).

Finally, later on, after all the transformations and interactions of variables found in EDA and Model Development, we did the last steps of preparation such as splitting the final dataset into 5 K-folds for K-fold validation on our regression model and then also standardization for lasso regression. Through data preparations and EDA, we also found that some variables were determined after the release of a movie and could therefore not be used in our predictive linear regression model. These variables included the total number of users who interacted with the movie rating, the number of user reviews for a certain movie, and the number of critic reviews for a certain movie. (num_voted_users, num_user_for_reviews, num_critic_for_reviews).

Further examples of these observations and all the distributions of variables can be seen in the necessary tables displaying the distribution of all variables, mentioned in the Appendices and also included in the Data quality check / cleaning / preparation section in the report code.

6 Exploratory data analysis

6.1 Linear Regression EDA

  • The variables representing user and critic reviews had the strongest correlations with gross, but were unable to be used in our linear regression problem because they are known after the movie’s release. Following these variables, we observed that facebook_likes for the actors, cast and director were all moderately correlated with gross, and thus, we plotted scatterplots to observe relationships.

  • duration could benefit from being binned before model development, as the distribution shows that different durations have different relationships with gross. Four bins was the minimal amount of bins that captured the general trend of gross vs. duration, which we will show in model development. Above shows different major iterations of testing bins for duration.

  • The majority of the movies in our dataset were created in the 21st century. gross values might be subject to other factors we do not have the predictors for, such as inflation, economic contexts of different countries, etc. Thus, title_year might be important to interact with other predictors in our final model.

  • Animation, Adventure, and Family, Action, Fantasy were the top five highest grossing genres. These were also the five genres with the strongest correlation to gross, as we will see in Model Development.

6.2 Logistic Regression EDA

  • There are a lot of movies where the actors facebook likes are 0, so it was dummied to indicate whether the facebook likes indicate whether the cast is famous or not. The top 30% of actors_facebook_likes would be considered famous and the rest would not be famous.

  • Genre from above was initially thought of as a significant predictor from the EDA performed above, but EDA revealed that IMDB scores did not change significantly across different genres

7 Approach

Our first objective was to develop a linear regression model to predict gross income. We decided that a linear model would be best suited to predict the amount of money that a movie would generate, gross, as this is a continuous variable, and for stakeholders to analyze their budgets and spending against this predicted metric, it would need to be a predicted number rather than a probability or classification. In this context, we aimed to minimize the test MAE of our model, as this metric treats larger and smaller errors with equal worth (so we would get a sense of the“average error” of our model). We did not feel the need to consider larger errors with greater weight, as a greater loss is not exponentially worse than a smaller loss.

After conducting some preliminary research, we developed a basic understanding of the monetary conditions that qualify a movie as profitable. If a movie’s income exceeds its production budget, then the movie could technically be deemed profitable. However, there are other expenses that are not accounted for by budget, such as marketing promotions, and payments to corporations and studios. Hence, it is estimated that a movie is profitable when the gross income of a movie is three times the value of its production budget, which is the only other monetary variable we had access to through our dataset. This is the standard we utilized regarding our classification model for movie gross. We aimed to get our test MAE under the mean of gross itself, but kept in mind the metric for profitability while doing so, so that our model could be useful to stakeholders in the same sense. We wanted stakeholders to understand the possible relationships between our predictors and gross, but also be able to use the predicted gross as a baseline for determining monetary allocations in order to attempt to make money, rather than lose it.

To evaluate the success of our linear regression models, we initially randomly split our dataset of all observations into training and test datasets. The test dataset contained twenty percent of the observations, and the remaining observations made up the training data. We did this for initial, simple assessment of our many iterations of models, but then applied K-fold cross validation with 5-folds to our final developed models each step of the way, as this is a less biased method of estimating model MAE.

Our initial baseline model contained all the continuous-valued predictors in the dataset that were available before the movie was released (number of critics reviewed, number of voted users, number of users reviewed, and IMDB score were not used as predictors). There were no interactions and/or transformations included. We then did a deeper dive into the best ways to interact and transform these variables manually, through intuition, insights, and deep assessment. These transformations included things such as binning, square root transformations, etc. We also made steps to develop our model by dummying categorical variables to interact and use in our model. Our manual process of adding and removing predictors resembled a forward stepwise selection process, as we started with only a couple variables, and built up our model to one that performed better than the base model. Finally, we attempted lasso regression in order to observe if this method would help optimize our final chosen subset of variables further. Again, as we progressed through this approach, our aim was to minimize test MAE, so that it was below our base model’s test MAE, as well as the mean of gross from the total dataset. The major problem we ran into was that our variables in the dataset did not seem to adequately explain the response variable. Something we began to explore towards the end of our project with regards to gross and profit due to this was a logistic model with a response variable created that explained if a movie was highly profiting or not. This variable was created with our researched metric that 3x the production budget subtracted from gross is a good, conservative measure on whether a movie will profit or not. This new profit column was then transformed into 1’s for highly profiting movies (profit > 0) and 0’s for non-highly profiting movies. This model was not further optimized or developed and would be something to explore in next steps, since throughout our linear regression analysis we had difficulties with maximal optimization using this dataset and the approach we used, which was largely drawn from the content and code we learned this quarter.

Our second objective was to develop a classification model to predict IMDB ratings. Logistic regression was chosen here as IMDB rating is a discrete integer variable, and was thus more appropriate to split into a binary classification model of highly rated or low rated. This split was made based on if the movie was good (>7) or not and a new column good_movie was created that possessed 1’s for highly rated films and 0’s for the others. A movie was good if its IMDB score was above 7 because the average score lies around ~6. For context, the top 250 movies ranked by IMDB consists of movies rated above an 8. This binary classification increases interpretability of the model for inference and also for stakeholders. Additionally, there was a lack of significant correlation between continuous variables and the IMDB rating response variable, and logistic regression was therefore chosen for this reason as well.

We decided to optimize the precision as well as the false positive rate. We intended this model to be used for stakeholders in the movie production industry and thus false positives would have a much more drastic effect than a false negative. Movie production is costly both financially and in terms of time - thus we believed that it would be in the best interests of those involved in the movie production pre-release to minimize the false positive rate. Precision was also optimized so our model would be fit to accurately predict the positive class.

One issue we anticipated and ran into was the usability of the variables provided in the dataset. As our model would only be relevant to variables that can be measured before the movie was released there were many variables that had to be omitted from the model (e.g. gross , num_critic_for_reviews). Furthermore, there were some variables that we suspected would cause multicollinearity due to their similar natures such as actor_1_facebook_likes, actor_2_facebook_likes, etc. These issues were later solved in the model development data preparation.

Another large issue that we ran into was the imbalance in the dataset in which only around 30% of the data consisted of good movies (which makes sense as there are many more ‘bad’ movies than ‘good’ ones). To counter this, we implemented the Synthetic Minority Oversampling Technique (SMOTE). This would artificially create data for the minority class (good_movies) that we later would train our logistic model on.

8 Developing the model

In our project we aimed to create two models. The first model is a linear regression model to predict gross income of a movie, and it was created using manual variable selection and an attempt at lasso regression (standardization and variable selection). The second model is a logistic regression model to classify movies as highly-rated or lowly-rated, and this was also created using manual variable selection as well as forward stepwise selection. The regression model development is explained below in an additive format. We aimed to minimize MAE with the continuous and relevant categorical variables we had in our dataset, while also trying to balance the ease of interpretability of our model.

8.1 Linear Regression Model Development

1. Baseline model

We began by creating a baseline model containing all the continuous-valued predictors in the dataset that were available before the movie was released (number of critics reviewed, number of voted users, number of users reviewed, and IMDB score were not used as predictors). There were no interactions and/or transformations included. This baseline model produced an MAE of around 39 million on the test dataset, but a MAE of around 42 million using K-Fold cross validation with 5 folds. Both of these values are slightly less than the mean of gross in general, which is 48.5 million, but we are aiming to optimize our MAE more than this so that our predictions could be more useful for stakeholders in advising them on what their projected gross is so that they can allocate funds accordingly and aim to maximize gross based on relevant predictors. This baseline model also indicated that aspect_ratio might not be significant in our manual forward variable selection method for this model. The rest of the continuous variables in this model seemed to be significant.

2. Initial model for manual forward selection: facebook_likes (VIF analysis)

After conducting our EDA and observing the baseline model predictor significance, we decided to begin our model development with a regression model using the facebook_likes variables to predict gross. However, we first conducted a VIF analysis of the predictors representing facebook likes, in order to rule out any instances of multicollinearity, as intuitively, these variables seem to be related. After observing the VIF values, we determined that actor_1_facebook_likes, cast_total_facebook_likes, actor_3_facebook_likes and actor_2_facebook_likes were involved in multicollinearity. Hence, our first model attempt contained only movie_facebook_likes, cast_total_facebook_likes, and director_facebook_likes. The model had an R squared of 0.176, which indicated that our response was minimally explained by our chosen predictors. The p-values for the three predictors were all 0.00 though, indicating significance in regards to this initial model for our manual forward variable selection process. The validation set MAE was 40 million, and the K-Fold cross validation MAE was 43.4 million.

3. duration: binning, adding to the model

Our next focus was to explore the continuous predictor duration within our model. We observed in our EDA for regression that duration had a notable relationship with gross that might benefit from binning—particularly, we noted that gross income only began to increase significantly once movies exceeded 114 minutes. Consequently, we created 4 custom bins in order to best visualize the generally positive correlation and relationship between duration and gross, as seen in the visualization below. After adding the binned duration dummy variables to our previous model (which contained the significant, independent facebook_likes variables`), we saw an 0.031 increase in R-squared. The p-value of each duration bin was significant at a 0.1 or 0.05 alpha value. The MAE on the validation set and through K-Fold cross validation also decreased to 39 million and 42.7 million, respectively.

4. budget: variable transformation, adding to model

Following the addition of duration to our model, we decided to further explore the relationship between budget and gross income, specifically in relation to model assumptions as well. First, a linear regression model to predict gross using one predictor, budget, was created. This model didn’t explain the variance in our target response variable very well, but we fit the model in order to visualize a plot of fitted values and residuals in an attempt to better understand budget and model assumptions. This visualization is displayed below. It is evident from this visualization that budget violates multiple linear model assumptions, as the residuals are not evenly distributed around the red line in a random fashion. We decided to perform a square root transformation of the budget before adding it to our model, along with adding budget itself as a variable. The addition of budget and its transformed counterpart resulted in a model with an R squared of 0.380. After predicting gross using this iteration of our model, we computed an MAE of approximately 33 million using a validation test set method, and approximately 34 million using K-fold cross validation— decrease from the previous model. All predictors, including the new budget predictors remained significant, except for two of the duration bins, which we still decided to move forward with due to their relationships with one another.

5. title year and genre: preparation, dummy variables, and interaction

After adding budget to our model, we decided to begin exploring relevant categorical variables, beginning with genres as a potential predictor of gross. Our EDA, after data cleaning, revealed that the highest correlated genres with gross were Adventure, Action, Family, Animation, and Fantasy. We decided to add these genres to our previous model, to see if they would improve its performance. The addition of these selected genres increased our R-squared and were significant additions. They further decreased our MAE on the validation set and through K-fold cross validation, but we wanted to move a step further by introducing a new continuous variable to interact with these genres based on the intuition that the most popular and prolific genres of movies might be related to the title_year in which the movie was released.

When considering title_year as a potential predictor for gross income, we initially assumed that recent movie release years would exhibit higher grosses, due to the possible economic inflation or differences in economic contexts throughout time. This assumption was not necessarily confirmed by our EDA visualizing year and gross, but there was indeed a pattern in the plot of gross vs. title_year. As a result we decided to add title_year to our model, and also try interacting it with the pertinent genres based in the intuition stated above. These additions of variables (the five most correlated genres and title_year, along with their interactions), decreased our validation set MAE to around 32 million, and the K-Fold cross validation MAE to 33.5 million.

6. Adding country bins to model

The majority of movies in our data were produced in the United States and the United Kingdom. It is also apparent that these two countries have distinctly higher means for gross over time based on the plots in our EDA. We decided to add two predictors, for the movies with origins in the USA and UK, to our previous model. We also interacted these terms with budget, due to the fact that movies from different countries might be exposed to different economic environments and constraints. We then saw a further increase in R-squared with these predictors being classified as significant, and a decrease in MAE on our test data—30 million for the validation set method and 29 million for the K-Fold cross validation method, which are less than the baseline model metrics, as well as the means of gross from the original dataset (48 million). Most predictors remained significant at a 0.10 alpha level.

7. Final model

Based on this manual forward selection of variables, and a trial of different variable additions, interactions, and transformations throughout our time with this data this quarter, the model that possessed the lowest MAE metric and explained the variance in the response variable with the greatest success (0.508 R-squared) is the model after adding country and interacting it with budget to all the predictors added previously. We thus removed influential points from the train dataset used to fit this model, and then refitted it to get a slightly lower validation set MAE by 100,000. To summarize, the final number of predictors used in the model, including all interaction and transformed terms, was 23. They are listed in the model summary with coefficients as follows: Again, this model also resulted in the lowest MAE of all of our model versions. We note that as we added new predictors to our model, the p-values of some predictors increased. In future development, we would hope to optimize low p-values for all of our predictors to increase interpretability.

After all the steps explained above, we ended up with a subset of predictors containing all the continuous variables except aspect_ratio, as well as the dummy variables for the 5 highest correlated genres and the two most prolific countries when it came to estimating gross. We attempted to use this subset of variables in lasso regression in order to aid interpretability and discover the most contributing or “important” predictors in the model with these variables. While the MAE of the lasso fitted model was not optimized at the most minimum value out of the MAEs calculated previously, this regression method suggested that director_facebook_likes was the most important predictor.

Our final model for linear regression, as explained in the above section, while explaining the most percentage of variance in the response, was still not ideal, and for a close to perfect predictive model, our MAE should probably be much less, despite it being under the mean of gross for the whole dataset (as well as the mean of budget, which increases the odds that our predictive model will not be off by a whole margin of possible profit, as we defined a profiting movie to be greater than 3 times the budget, but this was not our main target threshold for this model, just general context to keep in mind as we developed the model). Thus, noting these shortcomings, we have found that the predictors in this dataset might not be adequately explanatory for a linear regression model, and better suited to a logistic model. We began to explore this by creating a profit column defined as gross-(3*budget). We then transformed the positive values of profit into 1’s and the remaining values to 0’s to use as a logistic response variable. However, as this was a deviation from our initial objective to create a predictive model for the continuous values of gross, we focused instead our efforts on the logistic model we initially set out to optimize from our dataset, which is again, a model to predict highly rated movies on IMDB.

8.2 Logistic Regression Model Development

Our second model is a logistic regression model that would predict the IMDB score of a movie. We believed that developing a model that has a low false positive rate would be the best for our clients since it would decrease their chances of producing a film that is predicted to fail. Before developing our model, we used SMOTE to balance out our data. SMOTE is a technique that reduces imbalance in data sets by creating artificial data values, and thus providing roughly equal amounts of good movies and bad movies. Our initial model was developed by using a formula that included all predictors provided in the data set. The second model was improved by keeping any predictor that was statistically significant and removing any predictor that was not. Our third model included some interaction terms, and our last model was finalized by using a variable selection algorithm and optimizing the decision threshold probability.

1. Baseline Model

Our first model was an essential step in understanding the relationship between the predictors and IMDB scores. By including all predictors, we established a baseline for predicting the score of a movie. Furthermore, we did not include any interaction terms to truly capture the most basic and fundamental model. This approach allowed us to explore which predictors were significant in the model, and provided a starting point for future model development. Additionally, the first model served as a reference point for comparing the performance of subsequent models. By establishing this baseline, we were able to assess the incremental improvement of each subsequent model and determine which approach would best predict the outcome variable. With this model, we reached a classification accuracy of 76.3% and a false positive rate of 6.5%. Although the false negative rate was surprisingly low, we knew that we could optimize this value further by improving different aspects of the model.

2. Second Model - Manual Variable Selection

In our second model, we decided to refine our model by manually selected variables that were statistically significant. By examining the coefficients of predictors from the first model, we were able to determine which predictors were significant in predicting IMDB scores. Using these coefficients, we kept the predictors with low coefficients, which had the greatest impact on the model, and removed all others. We were left with the following predictors: “duration”, “director_facebook_likes”, “actors_facebook_likes”, and “budget”. This approach helped to simplify the model and reduce the risk of overfitting. Removing non-significant predictors reduced the number of variables in the model, making it less complex and less likely to overfit. This allowed the model to generalize more effectively to new data and reduced the chances of falsely predicting a high IMDB score for a movie. This development lowered our classification accuracy to 76.0%, but we were able to also lower our false positive rate to 4.3%. Although our classification accuracy may have fallen, we considered this an improvement of the model since lowering the false positive rate helped us reach closer to our ultimate goal of avoiding false hope for movie producers.

3. Third Model - Interaction and Transformations

In our third model, we introduced interaction terms and transformations, which allowed us to model the joint effects of predictors on the response variable. First, we interacted “director_facebook_likes” with “actors_facebook_likes” since we believed that these two variables would influence each other’s effect on the IMDB score. By incorporating this interaction term, we were able to capture more nuanced relationships between variables and potentially improve the accuracy of our predictions. Additionally, we decided to get the logarithmic value of “budget” due to its immense value. By including interaction terms and transformations, we could account for the joint effects of correlated predictors, and were able to increase our classification accuracy to 77.6%. However, our false positive rate has increased slightly to 4.9%. Comparing this model to our baseline model, we were able to increase our classification model, and decrease our false positive rate.

4. Final Model - Best Subset Selection / Forward Stepwise Selection / Optimize Decision Threshold Probability

For our fourth model, we used best subset selection, forward stepwise selection, and also optimized the decision threshold probability. Using best subset selection, we identified a subset of predictors that faced the highest potential in predicting IMDB scores. Then forward stepwise selection allowed us to select the predictors that would best predict the response variable from the subset determined in the previous step. This selection approach helped optimize the model by selecting only the most important variables and reducing overfitting. By selecting only the most important predictors, we reduced the complexity of the model and increased its ability to generalize to new data, improving the accuracy of predicting the IMDB score of a movie. Finally, we determined a decision threshold probability that would optimize the false positive rate. These three steps led to our final model that reached a classification accuracy of 73.4% and a false positive rate of 0.9%. Compared to the first model, our classification accuracy slightly decreased but our false positive rate decreased drastically. Therefore, if this model predicts that a film would be good, stakeholders can feel safe producing the movie.

9 Limitations of the model with regard to inference / prediction

Beyond the predictors available to explain gross income and IMDB rating in our dataset, there are other factors that contribute to a movie’s success these days, which therefore limit the practical implementation of our model(s). For one, the movies in our data were released before the rise of COVID-19. During the pandemic, the movie industry underwent a notable shift away from in-person theaters. This hindered the revenue of international box offices, or caused them to have to adapt in a way that is not captured in our dataset, which was compiled before 2020. On the other hand, the pandemic might have caused a larger amount of people to participate in rating platforms such as IMDB, which could also affect the results of our logistic regression model that predicts whether a movie is likely to be highly rated or not. Hence, while our model offers some insights into the qualities that are associated with a movie’s success, some of these qualities might be more or less important in post-pandemic society.

Additionally, the variables present in our data do not encompass the artistic elements of a movie, which often are essential to audience opinions and thus also contribute to gross income from box offices. When a person leaves a review of a movie on IMDB, their rating is typically accompanied by a description of the movie, including the aesthetics, plot, character analysis, and themes. These are all components that are not as readily able to be used in a statistical model, and hence our models can not be the sole tools for predicting a movie’s success.

Our models have an advantage in the sense that the variables used for prediction and inference are easily obtainable before the release of a movie. Most producers, directors, and studios, have full reign and knowledge regarding the release year, chosen actors/actresses, budget, etc. However, as explained earlier, this comes with the cost that these predictors might not be the most suitable for a model predicting quantitative financial data. Both of these models though can be used as soon as the stakeholder has access to all the information, which should definitely be before the release date and the final decision-making on budgeting and marketing costs.

10 Conclusions and Recommendations to stakeholder(s)

According to our final regression model pertaining to gross income, it is evident that an increase in any of the facebook_likes variables is associated with a positive increase in gross. This association is supported by the p-values of movie_facebook_likes, cast_facebook_likes, and director_facebook_likes, which are all less than an alpha value of 0.05 in our model, meaning that the null hypothesis that they have no relationship with the response is rejected. As a result, we generally can recommend that movie producers consider hiring popular actors (who have more facebook_likes), as they have the potential to increase a movie’s gross. While popular actors might cost more to employ, spending more money on the quality of actors is likely beneficial for a movie’s level of gross, which could, in theory translate to a higher profitability depending on exact financial numbers. Specifically though, and maybe not as intuitive, is having a well-known director with higher facebook likes. The specific coefficients for facebook likes are 776.0382 for movie_facebook_likes, 249.5284 for cast_total_facebook_likes, and 731.5793 for director_facebook_likes. This signifies that while having more popular actors/actresses might increase the gross, more importantly (since facebook likes are on the same scale), the popularity of the director might have more of an effect on box office grosses. The marketing of the movie and popularity of the movie before its release on social media also has a positive association. Thus, optimizing these variables could maximize grosses, which, without changing the budgets and spending, would theoretically maximize profits.

We have also observed that an increase in movie duration is associated with a positive increase in gross. However, across the three duration bins, gross has the smallest positive increase in the 3rd bin containing the longest movies. Hence, we advise movie directors and producers to be mindful of duration: as a movie that is over two hours might not generate as high of a gross as a movie of medium length (approximately between one hour, and one hour and thirty minutes). Also, this analysis of theses predictors should be carefully approached, as two of the duration bins for the shorter movies do not necessarily appear significant at a 0.05 alpha level.

We also looked at the budget related predictors in the linear related to gross. When a movie is released from the US or the UK as its country of origin with the same budget and holding all other variables constant (except variables relates to country and budget), the US movies tend to see a greater increase in gross, while the UK movies generally see a lesser increase in gross, and other countries might see a decrease in gross. Particularly, if the budget is increased by one unit from the mean budget of the dataset, both the UK and US see about a unit increase in gross, whereas other countries do not really see any increase or decrease. But, because there is a square-rooted term within the relationships between budget and gross, this relationship is not necessarily linear/constant. Thus, we generally conclude that for US and UK countries an increase in budget is most likely related to an increase in gross, with US countries grossing the highest in general. Thus, stakeholders can keep this relationship in mind, depending on the country they are making their movie in.

Finally, Action movies are statistically significant in increasing gross, so if stakeholders are purely interested in increasing thor box office gross income, creating a movie that has the genre of Action included would be beneficial, based on our model.

Unlike the conclusions found in the regression model, we found that in the logistic model a decrease in many of the variables would lead to a movie having a higher chance of scoring better. This comes to show the unpredictable of how well a movie would do in the audience’s view. Our final model revealed that one of the most significant predictors of whether a movie would do well or not was the country of production - movies that were produced outside of the US tended to fare better in terms of IMDB. Furthermore, a decrease in aspect ratio also increases the likelihood of a movie having a higher IMDB score. Two other predictors that had a substantial impact on the odds were facenumber_in_poster and actors_facebook_likes. Specifically a one unit increase in facenumber_in_poster yielded a 0.11 decrease and a unit increase in actors_facebook_likes yielded a 0.2 decrease in the odds of a movie classified as a ‘good’ movie.

However, it would not be intuitive for movie investors and producers to deliberately choose the country in which the movies are produced and to find actors with less facebook likes. As mentioned earlier, facebook likes increase the gross which is something that stakeholders would certainly enjoy. Thus, our immediate recommendation instead would be to not be discouraged to produce a foreign movie, even if the US is where most influential movies are produced. Furthermore, while the popularity of the cast and director may seem to increase how well a movie is received, ultimately movie production is highly qualitative and creative, meaning that our stakeholders should not be discouraged to debut a movie with less famous actors. For aspect ratio, older movies had a smaller aspect ratio, and thus should not be relevant to how well a movie will do.

One final important thing to note is that IMDB scores may be flawed for bad movies as they do not take into account voter credibility or the amount of votes. Less known movies may have lesser reviews and such have skewed IMDB scores.

GitHub and individual contribution

Link to GitHub Repository

Individual contributions

Team member Contributed aspects Details
Lila Weiner Data cleaning, EDA, linear regression model development, expository Developed visualizations to identify appropriate binning transformations, created EDA scatterplots for linear regression, linear regression model development, co-wrote all sections pertaining to linear regression as well as introductory sections, etc.
Gabrielle Bliss Data cleaning, EDA, assumptions and interactions, linear regression model development (lasso, removing influential points), expository Cleaned data to create dummy variables for genres, worked on transformations for budget, genre, and title year, completed VIF analyses, used KFoldCV function to assess accuracy, attempted lasso, co-wrote all sections pertaining to linear regression as well as introductory sections, etc.
Riley Otsuki Data cleaning/prep, Logistic model development Cleaned and prepped data for logistic model development. Created dummy variables for country and implemented oversampling through SMOTE. Improved model by implementing forward stepwise selection. Co-wrote all sections pertaining to logistic regression and related materials in the intro.
Kenneth Yeon Logistic model development, manual variable selection, predictor interactions, predictor transformations, calculating odds. Developed the first three models, and noted their improvements from the baseline model. Used mannual variable selection, interactions, and transformations to increase classification accuracy and decrease false positive rate. Calculated the odds of how changing the predictors would affect the predicted IMDB score.

Link to GitHub Insights and Commits

Gabby: I think I am still a little rusty on the branching aspect of GitHub and how to successfully commit and push to the main. It was easiest for me to just push to the main while coordinating with my group separately about when I was doing so in order to make sure we were not losing work. For a while, my GitHub desktop was giving me issues, and so I had Lila commit my new files for me by sending them over to her. While it was not the easiest of roads, I think I am more comfortable with GitHub as we are ending the quarter. I think I could definitely use GitHub on my own with a branch; the stress of making sure we were coordinating pushes, pulls, and merges accordingly as a group was the most concerning part about using the platform. In my opinion, GitHub did make things easier in the end, but I can see how if the whole group was 100% comfortable using it it would be even more useful.

Lila: At the beginning of the quarter, I was apprehensive about using GitHub, as the platform was unfamiliar to me. However, after watching the introductory videos posted to Canvas, I was more comfortable working on my own branch, and merging my progress with the main. I think it would have been beneficial to have GitHub incorporated into the weekly assignments somehow, in order to gain more experience.

Kenny: I had some initial issues with GitHub, but I was able to quickly learn how it works. One of the main challenges I faced was when GitHub would not allow me to merge my changes into the main branch due to conflicting issues. Although I have used GitHub a lot through this project, I would still say that there are some aspects of GitHub that I am not entirely comfortable with. However, I do feel that it made collaboration easier.

Riley: Like my group members I also found merging my own files to the main file. However, now I think that I am quite comfortable with using GitHub and find it helpful for collaboration. However, I am still not comfortable enough to figure out when a file is unable to be merged. Looking at the conflicts in github, there seems to be an issue when merging if our code was run a different amount of times within jupyter notebook. While, refreshing the entire code seemed to fix it, I am still eager to learn more about GitHub to make it a much more seamless experience.

References

[1] Thompson, Derek. “How Hollywood Accounting Can Make a $450 Million Movie ‘Unprofitable’.” The Atlantic, Atlantic
            Media Company, 14 Sept. 2011, https://www.theatlantic.com/business/archive/2011/09/how-hollywood-accounting-
            can-make-a-450-million-movie-unprofitable/245134/.

[2] Anders, Charlie Jane. “How Much Money Does a Movie Need to Make to Be Profitable?” Gizmodo, 31 Jan. 2011,
            https://gizmodo.com/how-much-money-does-a-movie-need-to-make-to-be-profitab-5747305.

Appendix

10.0.0.1 Data Quality Check, Categorical Variables
Top 3 Value Counts, Categorical
color Color 4815
Black and White 209
director_name Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
actor_2_name Morgan Freeman 20
Charlize Theron 15
Brad Pitt 14
genres Drama 236
Comedy 209
Comedy|Drama 191
actor_1_name Robert De Niro 49
Johnny Depp 41
Nicolas Cage 33
movie_title Ben-Hur 3
Pan 3
King Kong 3
actor_3_name John Heard 8
Ben Mendelsohn 8
Steve Coogan 8
plot_keywords based on novel 4
one word title 3
assistant|experiment|frankenstein|medical student|scientist 3
movie_imdb_link http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1 3
http://www.imdb.com/title/tt3332064/?ref_=fn_tt_tt_1 3
http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1 3
language English 4704
French 73
Spanish 40
country USA 3807
UK 448
France 154
content_rating R 2118
PG-13 1461
PG 701
Null Counts, Categorical
color 19
director_name 104
actor_2_name 13
genres 0
actor_1_name 7
movie_title 0
actor_3_name 23
plot_keywords 153
movie_imdb_link 0
language 12
country 5
content_rating 303
Unique Counts, Categorical
color 2
director_name 2398
actor_2_name 3032
genres 914
actor_1_name 2097
movie_title 4917
actor_3_name 3521
plot_keywords 4760
movie_imdb_link 4919
language 47
country 65
content_rating 18
10.0.0.2 Data Quality Check, Continuous Variables
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000
null count 50.000000 15.000000 104.000000 23.000000 7.000000 8.840000e+02 0.000000e+00 0.000000 13.000000 21.000000 4.920000e+02 108.000000 13.000000 0.000000 329.000000 0.000000